MCIC Wooster, OSU
2024-03-27
What are the advantages of creating such a pipeline or workflow, rather than running scripts one-by-one as needed?
Rerunning everything is much easier.
Re-applying the same set of analyses in a different project is much easier.
Ensuring you are including all necessary steps.
The pipeline is a form of documentation of the steps taken.
Improving reproducibility, e.g. making it easier for others to repeat your analysis.
Rerunning part of the pipeline may be necessary, e.g. after:
Some scripts fail for all or some samples.
Adding a sample.
Needing to modify a script or settings somewhere halfway the pipeline.
Batch job submissions pose problems:
for sample in ${samples[@]}; do
sbatch scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq
sbatch scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam
done
sbatch scripts/count.sh ${samples[@]} > res/count_table.txtif statements, many arguments to scripts, and Slurm “job dependencies” — but this is very hard to manage for more complex workflows.Typically, researchers codify workflows using general scripting languages such as Python or Bash. But these often lack the necessary flexibility.
Workflows can involve hundreds to thousands of data files; a pipeline must be able to monitor their progress and exit gracefully if any step fails. And pipelines must be smart enough to work out which tasks need to be re-executed and which do not.
Workflow tools, often called “workflow management systems”, provide ways to formally describe and execute workflows/pipelines.
Some of the most commonly used (command-line based) options in bioinformatics are:
Automation
Reproducibility and transparency
Flexibility, portability and scalability by separating generic pipeline nuts-and-bolts and:
Most workflow tools are small languages (domain-specific languages, DSLs), often a sort of extension of a more general language (Python for Snakemake, Groovy for Nextflow).
Learning one of these to write your own workflows is perhaps only worth it if you plan to at least somewhat commonly work on genomics/bioinformatics projects.
But there are also publicly available and well-documented workflows written by others that you can use.
The “nf-core” initiative (https://nf-co.re) curates a set of best-practice, flexible, and well-documented Nextflow workflows.
It also has some of its own tooling built on top of Nextflow to help contributors create more robust and user-friendly workflows.
nf-core currently has 58 complete pipelines — these are the four most popular ones:
We will run the nf-core rnaseq pipeline (https://nf-co.re/rnaseq), a reference-based RNA-seq pipeline that takes FASTQ and reference genome files as inputs to produce a gene count table and many “QC outputs”.
Running a pipeline like this is a bit different (and more involved) than running a typical piece of bioinformatics software — it is, after all, stitching together many operations. Some considerations:
We only need to install Nextflow and we merely download the workflow files.
The pipeline runs all its constituent tools via Singularity containers (most common & recommended) or via Conda environments.
-resume option, the workflow will check what needs to be (re)run and what doesn’t: this is quite intricate but generally works well.The nf-core rnaseq pipeline is meant for RNA-seq projects that:
That might seem quite specific, but this is by far the most common use RNA-seq use case.
There are two main parts to the kind of RNA-seq data analysis I just described:
Read QC and pre-processing
Read QC (FastQC)
Adapter and quality trimming (TrimGalore)
Optional removal of rRNA (SortMeRNA) — off by default, but we will include this
Alignment & quantification
Alignment to the reference genome/transcriptome (STAR)
Gene expression quantification (Salmon)
Post-processing, QC, and reporting
Post-processing alignments: sort, index, mark duplicates
Alignment/count QC (RSeQC, Qualimap, dupRadar, Preseq, DESeq2)
Create a report (MultiQC)
This pipeline is quite flexible and you can turn several steps off, add optional steps, and change individual options for most tools that the pipeline runs.
This part is “bioinformatics-heavy” with large files, a need for lots of computing power, and command-line (Unix shell) programs — it specifically involves:
Read pre-processing
Aligning reads to a reference genome
Quality control (QC) of both the reads and the alignments
Quantifying expression levels
Read pre-processing includes the following steps:
The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts):
Here are some examples of the kinds of things to pay attention to:
At heart, a simple counting exercise once you have the alignments in hand.
But made more complicated by sequencing biases and multi-mapping reads.
Current best-performing tools (e.g. Salmon) do transcript-level quantification — even though this is typically followed by gene-level aggregation prior to downstream analysis.
Fast-moving field
Several very commonly used tools like FeatureCounts (>15k citations) and HTSeq (<18k citations) have become disfavored in the past couple of years, as they e.g. don’t count multi-mapping reads at all.
The second part of RNA-seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on statistics, data visualization and biological interpretation.
It is typically done with the R language, and common steps include:
Principal Component Analysis (PCA)
Assessing overall sample clustering patterns
Differential Expression (DE) analysis
Finding genes that differ in expression level between sample groups (DEGs)
Functional enrichment analysis
See whether certain gene function groupings are over-represented among DEGs
params.greeting = 'Hello world!'
greeting_ch = Channel.of(params.greeting)
process SPLITLETTERS {
input:
val x
output:
path 'chunk_*'
script:
"""
printf '$x' | split -b 6 - chunk_
"""
}
process CONVERTTOUPPER {
input:
path y
output:
stdout
script:
"""
cat $y | tr '[a-z]' '[A-Z]'
"""
}
workflow {
letters_ch = SPLITLETTERS(greeting_ch)
results_ch = CONVERTTOUPPER(letters_ch.flatten())
results_ch.view { it }
}